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Abstract 


The  learning  of  probabilistic  models  with  many  hidden  variables  and  non- 
decomposable  dependencies  is  an  important  and  challenging  problem.  In  contrast 
to  traditional  approaches  based  on  approximate  inference  in  a  single  intractable 
model,  our  approach  is  to  train  a  set  of  tractable  submodels  by  encouraging  them 
to  agree  on  the  hidden  variables.  This  allows  us  to  capture  non-decomposable 
aspects  of  the  data  while  still  maintaining  tractability.  We  propose  an  objective 
function  for  our  approach,  derive  EM-style  algorithms  for  parameter  estimation, 
and  demonstrate  their  effectiveness  on  three  challenging  real-world  learning  tasks. 


1  Introduction 


Many  problems  in  natural  language,  vision,  and  computational  biology  require  the  joint  modeling  of 
many  dependent  variables.  Such  models  often  include  hidden  variables,  which  play  an  important  role 
in  unsupervised  learning  and  general  missing  data  problems.  The  focus  of  this  paper  is  on  models 
in  which  the  hidden  variables  have  natural  problem  domain  interpretations  and  are  the  object  of 
inference. 

Standard  approaches  for  learning  hidden-variable  models  involve  integrating  out  the  hidden  vari¬ 
ables  and  working  with  the  resulting  marginal  likelihood.  However,  this  marginalization  can  be  in¬ 
tractable.  An  alternative  is  to  develop  procedures  that  merge  the  inference  results  of  several  tractable 
submodels.  An  early  example  of  such  an  approach  is  the  use  of  pseudolikelihood  [1],  which  deals 
with  many  conditional  models  of  single  variables  rather  than  a  single  joint  model.  More  generally, 
composite  likelihood  permits  a  combination  of  the  likelihoods  of  subsets  of  variables  [7],  Another 
approach  is  piecewise  training  [10,  11],  which  has  been  applied  successfully  to  several  large-scale 
learning  problems. 

All  of  the  above  methods,  however,  focus  on  fully-observed  models.  In  the  current  paper,  we  develop 
techniques  in  this  spirit  that  work  for  hidden-variable  models.  The  basic  idea  of  our  approach  is  to 
create  several  tractable  submodels  and  train  them  jointly  to  agree  on  their  hidden  variables.  We 
present  an  intuitive  objective  function  and  efficient  EM-style  algorithms  for  training  a  collection  of 
submodels.  We  refer  to  this  general  approach  as  agreement-based  learning. 

Sections  2  and  3  presents  the  general  theory  for  agreement-based  learning.  In  some  applications,  it 
is  infeasible  computationally  to  optimize  the  objective  function;  Section  4  provides  two  alternative 
objectives  that  lead  to  tractable  algorithms.  Section  5  demonstrates  that  our  methods  can  be  ap¬ 
plied  successfully  to  large  datasets  in  three  real  world  problem  domains — grammar  induction,  word 
alignment,  and  phylogenetic  hidden  Markov  modeling. 
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2  Agreement-based  learning  of  multiple  submodels 


Assume  we  have  M  (sub)models  pm(x,  z;  9m),  m  =  1, . . . ,  M,  where  each  submodel  specifies  a 
distribution  over  the  observed  data  x  £  X  and  some  hidden  state  z£2.  The  submodels  could  be 
parametrized  in  completely  different  ways  as  long  as  they  are  defined  on  the  common  event  space 
X  x  Z.  Intuitively,  each  submodel  should  capture  a  different  aspect  of  the  data  in  a  tractable  way. 

To  learn  these  submodels,  the  simplest  approach  is  to  train  them  independently  by  maximizing  the 
sum  of  their  log-likelihoods: 

OmAev{0)  c=  logTT^(x,z;  9m)  =  y^log pm(x;9m}t  (1) 

m  z  m 

where  6  =  9  m)  is  the  collective  set  of  parameters  and  pm(x;  9m)  =  ^2  pm(x,  z;  9m) 

is  the  likelihood  under  submodel  pm . 1  Given  an  input  x,  we  can  then  produce  an  output  z  by 
combining  the  posteriors  pm( z  |  x;  0m)  of  the  trained  submodels. 

If  we  view  each  submodel  as  trying  to  solve  the  same  task  of  producing  the  desired  posterior  over 
z,  then  it  seems  advantageous  to  train  the  submodels  jointly  to  encourage  “agreement  on  z.”  We 
propose  the  following  objective  which  realizes  this  insight: 

agree(0)d=  iog^[]^(x  ,  z;  dm)  =  ^  log Pm(x;  9m)  +  log  i  Om  )■  (2) 

z  m  m  z  m 

The  last  term  rewards  parameter  values  6  for  which  the  submodels  assign  probability  mass  to  the 
same  z  (conditioned  on  x);  the  summation  over  z  reflects  the  fact  that  we  do  not  know  what  z  is. 


(-^ agree  has  a  natural  probabilistic  interpretation.  Imagine  defining  a  joint  distribution  over  M  inde¬ 
pendent  copies  over  the  data  and  hidden  state,  (x1;  zx ) , . . . ,  (xM .  zM),  which  are  each  generated 
by  a  different  submodel:  p((xt,  zi), . . . ,  (xM,  zM);  0)  =  UmP(xm,^m',  dm).  Then  Oagree  is  the 
probability  that  the  submodels  all  generate  the  same  observed  data  x  and  the  same  hidden  state: 
p(x i  =  •  •  •  =  xM  =  x,  zi  =  •  •  •  =  zM;  Q). 


^2 agree  is  also  related  to  the  likelihood  of  a  proper  probabilistic  model  pnoim,  obtained  by  normalizing 
the  product  of  the  submodels,  as  is  done  in  [3],  Our  objective  0agree  is  then  a  lower  bound  on  the 
likelihood  under  pmmn: 


_  o\  Ez  nm^»(x,z;  dm) 

P normv^-5  u )  v-^  -p-r  /  n  \ 

x.z  m 


> 


Ilm  Pmjx,  z;  9m) 
nEfm(x  i  z,  9m) 

m  x.z 


=  C>agree(0)- 


(3) 


The  inequality  holds  because  the  denominator  of  the  lower  bound  contains  additional  cross  terms. 
The  bound  is  generally  loose,  but  becomes  tighter  as  each  prn  becomes  more  deterministic.  Note 
that  pnorm  is  distinct  from  the  product-of-experts  model  [3],  in  which  each  “expert”  model  pm  has 
its  own  set  of  (nuisance)  hidden  variables:  pp0e(x)  oc  JXm  Ez  Pm(x,  z;  9m).  In  contrast,  pnorm  has 
one  set  of  hidden  variables  z  common  to  all  submodels,  which  is  what  provides  the  mechanism  for 
agreement-based  learning. 


2.1  The  product  EM  algorithm 

We  now  derive  the  product  EM  algorithm  to  maximize  0agree-  Product  EM  bears  many  striking 
similarities  to  EM:  both  are  coordinate-wise  ascent  algorithms  on  an  auxiliary  function  and  both 
increase  the  original  objective  monotonically.  By  introducing  an  auxiliary  distribution  q( z)  and 
applying  Jensen’s  inequality,  we  can  lower  bound  Oagree  with  an  auxiliary  function  C: 

OagreeW  =  log  £  q(z)  >  E  log  ^  Z;  9m)  def  (4) 

^  q{ z)  q{ z) 

The  product  EM  algorithm  performs  coordinate-wise  ascent  on  C(9 .  q).  In  the  (product)  E-step,  we 
optimize  C  with  respect  to  q.  Simple  algebra  reveals  that  this  optimization  is  equivalent  to  mini¬ 
mizing  a  KL-divergence:  C[9,  q)  =  — KL(g(z)  1 1  Y[m  Pm{x. ,  z;  9m))  +  constant,  where  the  constant 

'To  simplify  notation,  we  consider  one  data  point  x.  Extending  to  a  set  of  i.i.d.  points  is  straightforward. 
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does  not  depend  on  q.  This  quantity  is  minimized  by  setting  q( z)  oc  J7  pm  (x,  z;  0rn).  In  the  (prod¬ 
uct)  M-step,  we  optimize  £  with  respect  to  9 ,  which  decomposes  into  M  independent  objectives: 
C(0,q)  =  log  Pm  (x,  z;  9m)  +  constant,  where  this  constant  does  not  depend  on  9.  Each 

term  corresponds  to  an  independent  M-step,  just  as  in  EM  for  maximizing  Oindep- 

Thus,  our  product  EM  algorithm  differs  from  independent  EM  only  in  the  E-step,  in  which  the 
submodels  are  multiplied  together  to  produce  one  posterior  over  z  rather  than  M  separate  ones. 
Assuming  that  there  is  an  efficient  EM  algorithm  for  each  submodel  pm,  there  is  no  difficulty  in 
performing  the  product  M-step.  In  our  applications  (Section  5),  each  pm  is  composed  of  multinomial 
distributions,  so  the  M-step  simply  involves  computing  ratios  of  expected  counts.  On  the  other  hand, 
the  product  E-step  can  become  intractable  and  we  must  develop  approximations  (Section  4). 

3  Exponential  family  formulation 

Thus  far,  we  have  placed  no  restrictions  on  the  form  of  the  submodels.  To  develop  a  richer  under¬ 
standing  and  provide  a  framework  for  making  approximations,  we  now  assume  that  each  submodel 
pm  is  an  exponential  family  distribution: 

Pm(x,  z;  6m)  =  exp{0^0m(x,  z)  -  Am(6m)}  for  x  G  X,  z  €  Zm  and  0  otherwise,  (5) 
where  4>m.  are  sufficient  statistics  (features)  and  Am(9m)  =  log  Ylxex  Zgz  exp{0,^</>m(x,  z) }  is 
the  log-partition  function,2  defined  on  0rn  £  ©m  C  R'7.  We  can  think  of  all  the  submodels  p,„  as 
being  defined  on  a  common  space  Z{J  =  U mZm,  but  the  support  of  q( z)  as  computed  in  the  E-step  is 
only  the  intersection  Zn  =  nm2,„.  Controlling  this  support  will  be  essential  in  developing  tractable 
approximations  (Section  4.1). 

In  the  general  formulation,  we  required  only  that  the  submodels  share  the  same  event  space  X  x 
Z.  Now  we  make  explicit  the  possibility  of  the  submodels  sharing  features,  which  give  us  more 
structure  for  deriving  approximations.  In  particular,  suppose  each  feature  j  of  submodel  pm  can  be 
decomposed  into  a  part  that  depends  on  x  (which  is  specific  to  that  particular  submodel)  and  a  part 
that  depends  on  z  (which  is  the  same  for  all  submodels): 

i 

X,  z)  =  ]T  0mji(x)0f  ( z),  or  in  matrix  notation,  0m(x,  z)  =  0*  (x)</>2(z),  (6) 

i= 1 

where  0*(x)  is  a  J  x  I  matrix  and  (f>z( z)  is  a  I  x  1  vector.  When  z  is  discrete,  such  a  decompo¬ 
sition  always  exists  by  defining  <f>z( z)  to  be  an  |  Zu  j  -dimensional  indicator  vector  which  is  1  on  the 
component  corresponding  to  z.  Fortunately,  we  can  usually  obtain  more  compact  representations  of 
<j>z( z).  We  can  now  express  our  objective  £(9,  q)  (4)  using  (5)  and  (6): 

£(0,g)  =  +  ~  forq  £  Q(Zn),  (7) 

m  m 

def 

where  Q(Z')  =  {q  :  q( z)  =  0  for  z  ^  Z'j  is  the  set  of  distributions  with  support  Z'.  For 
convenience,  define  bm  =  and  b  =  which  summarize  the  parameters  9  for  the 

E-step.  Note  that  for  any  9 ,  the  q  maximizing  £  always  has  the  following  exponential  family  form: 

q( z;  (3)  =  exp{/3T (j>z (z)  —  AZn{/3)}  for  z  £  Zn  and  0  otherwise,  (8) 

where  AZn(/3)  =  log^zg2n  exp{/3T (f>z  (z) }  is  the  log-partition  function.  In  a  minor  abuse  of 
notation,  we  write  £{91  [3)  =  £{9,  q( •;  (3)).  Specifically,  £(9,  (3)  is  maximized  by  setting  (3  =  b. 

It  will  be  useful  to  express  (7)  using  convex  duality  [12],  The  key  idea  of  convex  duality  is  the 
existence  of  a  mapping  between  the  canonical  exponential  parameters  /3  £  R1  of  an  exponential 
family  distribution  q(  z;  (3)  and  the  mean  parameters  defined  by  p  =  E9(z;/3)0Z  (z)  £  A4(Zn)  C  M7, 
where  M(Z')  =  {p  :  3 q  £  Q(Z')  :  Eq<pz(z)  =  pj  is  the  set  of  realizable  mean  parameters.  The 
Fenchel-Legendre  conjugate  of  the  log-partition  function  A  zr  (  f3)  is 

A*zn  (m)  d=  sup  {(3Tp  -  AZn  (/?)}  for  p  £  M(Zn),  (9) 

/3eR7 

2  Our  applications  use  directed  graphical  models,  which  correspond  to  curved  exponential  families  where 
each  Om  is  defined  by  local  normalization  constraints  and  Am(6m)  =  0. 
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which  is  also  equal  to  —H(q( z;  /?)),  the  negative  entropy  of  any  distribution  q( z;  (3)  corresponding 
to  /i.  Substituting  /i  and  (/i)  into  (7),  we  obtain  an  objective  in  terms  of  the  dual  variables  /r: 

£*(0,  p)  d^f  (X)M(x))^  -  -  E  for  A*  G  M-Zn).  (10) 

m  m 

Note  that  the  two  objectives  are  equivalent:  sup^gR/  £(0,  (3)  =  sup ^MiZr)  £*(S,  fi)  for  each  6. 
The  mean  parameters  /i  are  exactly  the  z-specific  expected  sufficient  statistics  computed  in  the  prod¬ 
uct  E-step.  The  dual  is  an  attractive  representation  because  it  allows  us  to  form  convex  combinations 
of  different  //,  an  operation  does  not  have  a  direct  correlate  in  the  primal  formulation.  The  product 
EM  algorithm  is  summarized  below: 


Product  EM 

E-step: 

M  =  argmaxM,e_M(Zn){6TAi'  -  A*Zn(p')} 

M-step: 

9m  =  argma x^e0m{6^*(x)/z  -  Am{6'm)} 

4  Approximations 

The  product  M-step  is  tractable  provided  that  the  M-step  for  each  submodel  is  tractable,  which 
is  generally  the  case.  The  corresponding  statement  is  not  true  for  the  E-step,  which  in  general 
requires  explicitly  summing  over  all  possible  z  €  Zn,  often  an  exponentially  large  set.  We  will  thus 
consider  alternative  E-steps,  so  it  will  be  convenient  to  succinctly  characterize  an  E-step.  An  E-step 
is  specified  by  a  vector  b'  (which  depends  on  9  and  x)  and  a  set  Z'  (which  we  sum  z  over): 

E(b\  Z')  computes  /i  =  argmax  {b,T //  —  Az,(fi')}.  (11) 

n'eM{z') 

Using  this  notation,  E(bm,Zm)  is  the  E-step  for  training  the  m-th  submodel  independently  using 
EM  and  E(b,  Zn)  is  the  E-step  of  product  EM.  Though  we  write  E-steps  in  the  dual  formulation,  in 
practice,  we  compute  fi  as  an  expectation  over  all  z  €  Z\  perhaps  leveraging  dynamic  programming. 

If  E(bm,  Zm )  is  tractable  and  all  submodels  have  the  same  dynamic  programming  structure  (e.g., 
if  z  is  a  tree  and  all  features  are  local  with  respect  to  that  tree),  then  E(b,  Zn)  is  also  tractable:  we 
can  incorporate  all  the  features  into  the  same  dynamic  program  and  simply  run  product  EM  (see 
Section  5.1  for  an  example). 

However,  E(b ,  Zn)  is  intractable  in  general,  owing  to  two  complications:  (1)  we  can  sum  over  each 
Zm  efficiently  but  not  the  intersection  Zn ;  and  (2)  each  bm  corresponds  to  a  decomposable  graphical 
model,  but  the  combined  b  =  Em  brn  corresponds  to  a  loopy  graph.  In  the  sequel,  we  describe  two 
approximate  objective  functions  addressing  each  complication,  whose  maximization  can  be  carried 
out  by  performing  M  independent  tractable  E-steps. 


4.1  Domain-approximate  product  EM 


Assume  that  for  each  submodel  prn,  E(b,Zm)  is  tractable  (see  Section  5.2  for  an  example).  We 
propose  maximizing  the  following  objective: 


Alom(0)  Ml)  •  ■  •  i/^m)  —  ^  El  f  y  '@m,<t>m'(x))lJ'rn  (12) 


with  each  G  A4(Zm).  This  objective  can  be  maximized  via  coordinate-wise  ascent: 


Domain-approximate  product  EM 

E-step:  pm  =  argma x^M{Zm){bT p,'m  -  A*Zm  O'J}  [E(b,  Zm)\ 

M-step:  6m  =  argmax^e0m{6^*(x)(E  Em'  Aw)  - 


The  product  E-step  consists  of  M  separate  E-steps,  which  are  each  tractable  because  each  involves 
the  respective  Zm  instead  of  Zn .  The  resulting  expected  sufficient  statistics  are  averaged  and  used 
in  the  product  M-step,  which  breaks  down  into  M  separate  M-steps. 
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While  we  have  not  yet  established  any  relationship  between  our  approximation  £Aom  and  the  original 
objective  £*,  we  can,  however,  relate  £jjom  to  which  is  defined  as  an  analogue  of  C*  by  replacing 
Zn  with  Zu  in  (10). 

Proposition  1.  £,*iom(6 ,  fi\, . . .  ,Pm)  <  £y(0,/i)  for  all  9  and  /jm  £  A4(Zm)  and  p  = 
17  Em  Pm- 

Proof  First,  since  M(Zm)  C  A4(ZU)  and  M(ZU)  is  a  convex  set,  p  £  M(ZU),  so  Cy(0,p) 
is  well-defined.  Subtracting  the  version  of  (10)  from  (12),  we  obtain  C*Aom{9,  pi, . . . ,  Pm)  — 

Ctj{0,p)  =  A*Zu{p)  -  11  suffices  t0  sh0W  AZU(P)  <  ffEm^uW  < 

jj  ff  jy,  A*z  (pm).  The  first  inequality  follows  from  convexity  of  Az  (•) .  For  the  second  inequality: 
since  Zm  D  Zu,  AZu  (pm)  >  AZm  (pm);  by  inspecting  (9),  it  follows  that  AZu  (pm)  <  A*z  (pm). 

□ 


4.2  Parameter-approximate  product  EM 


Now  suppose  that  for  each  submodel  pm,  E(bm ,  Z n)  is  tractable  (see  Section  5.3  for  an  example). 
We  propose  maximizing  the  following  objective: 


-£par(^J  Ml)  •  •  •  1  Mm)  —  ^  (^ro^m(x))Mm  AZn  (Pm) 


-  y^jAm(9m), 


(13) 


with  each  /jm  £  J\A(Z{f).  This  objective  can  be  maximized  via  coordinate-wise  ascent,  which  again 
consists  of  M  separate  E-steps  E(Mbm,  Zn)  and  the  same  M-step  as  before: 


Parameter-approximate  product  EM 

E-step:  ^m  =  argrna x^M{Zm){{Mbm)T p'm- A*Zn{p'm)}  [E(Mbm,Zn)\ 

M-step:  9m  =  argma x^e0m{6'^^(x) P™')  ~  Am{0'm )} 


We  can  show  that  the  maximum  value  of  £*ar  is  at  least  that  of  C:\  which  leaves  us  maximizing  an 
upper  bound  of  £*.  Although  less  logical  than  maximizing  a  lower  bound,  in  Section  5.3,  we  show 
that  our  approach  is  nonetheless  a  reasonable  approximation  which  importantly  is  tractable. 

Proposition  2.  maxmeA1(2n)r..iMM6M(2n)  ^par(^.Mi)  •  ■  • ,  Mm)  >  maxMeA1(2n)  C*(0,  p). 

Proof  From  the  definitions  of  C*  (13)  and  C*  (10),  it  is  easy  to  see  that  C*m(6,  /z, . . . ,  fi)  = 
C*(9,/f)  for  all  /j  £  A4(Zn).  If  we  maximize  £*ar  with  M  distinct  arguments,  we  cannot  end  up 
with  a  smaller  value.  □ 

The  product  E-step  could  also  be  approximated  by  mean-field  or  loopy  belief  propagation  variants. 
These  methods  and  the  two  we  propose  all  fall  under  the  general  variational  framework  for  approx¬ 
imate  inference  [12].  The  two  approximations  we  developed  have  the  advantage  of  permitting  exact 
tractable  solutions  without  resorting  to  expensive  iterative  methods  which  are  only  guaranteed  to 
converge  to  a  local  optima. 

While  we  still  lack  a  complete  theory  relating  our  approximations  £<jom  anc*  to  the  original 
objective  C*,  we  can  give  some  intuitions.  Since  we  are  operating  in  the  space  of  expected  sufficient 
statistics  /rm,  most  of  the  information  about  the  full  posterior  pm( z  |  x)  must  be  captured  in  these 
statistics  alone.  Therefore,  we  expect  our  approximations  to  be  accurate  when  each  submodel  has 
enough  capacity  to  represent  the  posterior  pm(z  \  x;  9m)  as  a  low-variance  unimodal  distribution. 

5  Applications 

We  now  empirically  validate  our  algorithms  on  three  concrete  applications:  grammar  induction  using 
product  EM  (Section  5.1),  unsupervised  word  alignment  using  domain-approximate  product  EM 
(Section  5.2),  and  prediction  of  missing  nucleotides  in  DNA  sequences  using  parameter-approximate 
product  EM  (Section  5.3). 
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HMM  model 


Figure  1 :  The  two  instances  of  IBM  model  1  for  word  alignment  are  shown  in  (a)  and  (b).  The  graph 
shows  gains  from  agreement-based  learning. 


5.1  Grammar  induction 

Grammar  induction  is  the  problem  of  inducing  latent  syntactic  structures  given  a  set  of  observed 
sentences.  There  are  two  common  types  of  syntactic  structure  (one  based  on  word  dependencies  and 
the  other  based  on  constituent  phrases),  which  can  each  be  represented  as  a  submodel.  [5]  proposed 
an  algorithm  to  train  these  two  submodels.  Their  algorithm  is  a  special  case  of  our  product  EM 
algorithm,  although  they  did  not  state  an  objective  function.  Since  the  shared  hidden  state  is  a  tree 
structure,  product  EM  is  tractable.  They  show  that  training  the  two  submodels  to  agree  significantly 
improves  accuracy  over  independent  training.  See  [5]  for  more  details. 

5.2  Unsupervised  word  alignment 

Word  alignment  is  an  important  component  of  machine  translation  systems.  Suppose  we  have  a  set 
of  sentence  pairs.  Each  pair  consists  of  two  sentences,  one  in  a  source  language  (say,  English)  and 
its  translation  in  a  target  language  (say,  French).  The  goal  of  unsupervised  word  alignment  is  to 
match  the  words  in  a  source  sentence  to  the  words  in  the  corresponding  target  sentence.  Formally, 
let  x  =  (e,  f)  be  an  observed  pair  of  sentences,  where  e  =  (ei, . . . ,  e|e|)  and  f  =  (/i, . . . ,  /|f|);  z 
is  a  set  of  alignment  edges  between  positions  in  the  English  and  positions  in  the  French. 

Classical  models  for  word  alignment  include  IBM  models  1  and  2  [2]  and  the  HMM  model  [8], 
These  are  asymmetric  models,  which  means  that  they  assign  non-zero  probability  only  to  alignments 
in  which  each  French  word  is  aligned  to  at  most  one  English  word;  we  denote  this  set  Z\.  An 
element  z  €  Z\  can  be  parametrize  by  a  vector  a  =  (cti, . . . ,  a|f|),  with  cij  £  {Null,  1, . . . ,  |e|}, 
corresponding  to  the  English  word  (if  any)  that  French  word  fj  is  aligned  to.  We  define  the  first 
submodel  on  X  x  Z\  as  follows  (specializing  to  IBM  model  1  for  simplicity): 

|f| 

pi(x,z;0i)  =pi(e,f,a;6»i)  =pi(e)  ]\pi(aj)p1(fj  |e0j.;6»i),  (14) 

3=  1 

where  pi(e)  and  p  \  (a:l)  are  constant  and  the  canonical  exponential  parameters  ()\  are  the  transition 
log-probabilities  {log  fi;e/}  for  each  English  word  e  (including  Null)  and  French  word  /. 

Written  in  exponential  family  form,  <pz (z)  is  an  (|e|  +  l)(|f|  +  1) -dimensional  vector  whose  com¬ 
ponents  are  {4>fj( z)  £  {0, 1}  :  i  =  Null,  1, . . . ,  |e|,  j  =  Null,  1, . . . ,  |f|}.  We  have  cj)z( z)  =  1 
if  and  only  if  English  word  e,  is  aligned  to  French  word  fj  and  ^nullj  =  1  if  and  only  if  fj  is  not 
aligned  to  any  English  word.  Also,  0^-.,.,  (x)  =  1  if  and  only  if  e,  =  e  and  fj  =  f.  The  mean 
parameters  associated  with  an  E-step  are  {/r1;ij  },  the  posterior  probabilities  of  aligning  to  f  j ; 
these  can  be  computed  independently  for  each  j.  We  can  define  a  second  submodel  p2(x,  z;  02)  on 
X  x  Z2  by  reversing  the  roles  of  English  and  French.  Figure  l(a)-(b)  shows  the  two  models. 

We  cannot  use  product  EM  algorithm  to  train  pi  and  p2  because  summing  over  all  alignments 
in  Zn  =  Z\  n  Z-2  is  NP-hard.  However,  we  can  use  domain-approximate  product  EM  because 
E(b\  +  62,  Zm)  is  tractable — the  tractability  here  does  not  depend  on  decomposability  of  b  but  the 
asymmetric  alignment  structure  of  Zm.  The  concrete  change  from  independent  EM  is  slight;  we 
need  to  only  change  the  E-step  of  each  prn  to  use  the  product  of  translation  probabilities  fi;e/<2;/e 
and  change  the  M-step  to  use  the  average  of  the  edge  posteriors  obtained  from  the  two  E-steps. 
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(a)  Submodel  p\  (b)  Submodel  P2 


Figure  2:  The  two  phylogenetic  HMM  models,  one  for  the  even  slices,  the  other  for  the  odd  ones. 


[6]  proposed  an  alternative  method  to  train  two  models  to  agree.  Their  E-step  computes  /j  |  = 
E(bi,Zi)  and  /Z2  =  E(p2,Z2),  whereas  our  E-steps  incorporate  the  parameters  of  both  models 
in  bi  +  &2-  Their  M-step  uses  the  elementwise  product  of  fii  and  /z 2 ,  whereas  we  use  the  average 
!2  ( // 1  +  /i2).  Finally,  while  their  algorithm  appears  to  be  very  stable  and  is  observed  to  converge 
empirically,  no  objective  function  has  been  developed;  in  contrast,  our  algorithm  maximizes  (12). 
In  practice,  both  algorithms  perform  comparably. 

We  conducted  our  experiments  according  to  the  setup  of  [6].  We  used  100K  unaligned  sentences 
for  training  and  137  for  testing  from  the  English-French  Hansards  data  of  the  NAACL  2003  Shared 
Task.  Alignments  are  evaluated  using  alignment  error  rate  (AER);  see  [6]  for  more  details.  We 
trained  two  instances  of  the  HMM  model  [8]  (English-to-French  and  French-to-English)  using  10 
iterations  of  domain-approximate  product  EM,  initializing  with  independently  trained  IBM  model  1 
parameters.  For  prediction,  we  output  alignment  edges  with  sufficient  posterior  probability:  {  : 

1 i2-ij )  >  5}.  Figure  1  shows  how  agreement-based  training  improves  the  error  rate  over 
independent  training  for  the  HMM  models. 

5.3  Phylogenetic  HMM  models 

Suppose  we  have  a  set  of  species  s  £  S  arranged  in  a  fixed  phylogeny  (i.e.,  S  are  the  nodes 
of  a  directed  tree).  Each  species  s  is  associated  with  a  length  L  sequence  of  nucleotides  ds  = 
(ds  1, . . . ,  dSL )■  Let  d  =  {c?s  :  s  £  S}  denote  all  the  nucleotides,  which  consist  of  some  observed 
ones  x  and  unobserved  ones  z. 

A  good  phylogenetic  model  should  take  into  consideration  both  the  relationship  between  nucleotides 
of  the  different  species  at  the  same  site  and  the  relationship  between  adjacent  nucleotides  in  the  same 
species.  However,  such  a  model  would  have  high  tree-width  and  be  intractable  to  train.  Past  work 
has  focused  on  traditional  variational  inference  in  a  single  intractable  model  [9,  4].  Our  approach  is 
to  instead  create  two  tractable  submodels  and  train  them  to  agree.  Define  one  submodel  to  be 

Pi(x,  z;  9\)  =pi(d;6»i)  =  II  II  P^d»'i  I  d»j'i  0i)Pi(ds'j+i  \  ds>j,ds{j+ i);6>i),  (15) 

j  odd  sG5  s'GCh(s) 

where  Ch(s)  is  the  set  of  children  of  s  in  the  tree.  The  second  submodel  p2  is  defined  similarly, 
only  with  the  product  taken  over  j  even.  The  parameters  0rn  consist  of  first-order  mutation  log- 
probabilities  and  second-order  mutation  log-probabilities.  Both  submodels  permit  the  same  set  of 
assignments  of  hidden  nucleotides  (Zn  =  Z\  =  Z2).  Figure  2(a)-(b)  shows  the  two  submodels. 

Exact  product  EM  is  not  tractable  since  b  =  bi  +  b-2  corresponds  to  a  graph  with  high  tree-width. 
We  can  apply  parameter-approximate  product  EM,  in  which  the  E-step  only  involves  computing 
Hm  =  E(2bm,  Zn).  This  can  be  done  via  dynamic  programming  along  the  tree  for  each  two- 
nucleotide  slice  of  the  sequence.  In  the  M-step,  the  average  5(^1  +  /Z2)  is  used  for  each  model, 
which  has  a  closed  form  solution. 

Our  experiments  used  a  multiple  alignment  consisting  of  L  =  20, 000  consecutive  sites  belonging 
to  the  LI  transposons  in  the  Cystic  Fibrosis  Transmembrane  Conductance  Regulator  (CFTR)  gene 
(chromosome  7).  Eight  eutherian  species  were  arranged  in  the  phylogeny  shown  in  Figure  3.  The 
data  we  used  is  the  same  as  that  of  [9].  Some  nucleotides  in  the  sequences  were  already  missing.  In 
addition,  we  held  out  some  fraction  of  the  observed  ones  for  evaluation.  We  trained  two  models  using 
30  iterations  of  parameter-approximate  product  EM.3  For  prediction,  the  posteriors  over  heldout 

3We  initialized  with  a  small  amount  of  noise  around  uniform  parameters  plus  a  small  bias  towards  identity 
mutations. 
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Figure  3:  The  tree  is  the  phylogeny  topology  used  in  experiments.  The  graphs  show  the  predic¬ 
tion  accuracy  of  independent  versus  agreement-based  training  (parameter-approximate  product  EM) 
when  20%  and  50%  of  the  observed  nodes  are  held  out. 

nucleotides  under  each  model  are  averaged  and  the  one  with  the  highest  posterior  is  chosen.  Figure  3 
shows  the  prediction  accuracy.  Though  independent  and  agreement-based  training  eventually  obtain 
the  same  accuracy,  agreement-based  training  converges  much  faster.  This  gap  grows  as  the  amount 
of  heldout  data  increases. 

6  Conclusion 

We  have  developed  a  general  framework  for  agreement-based  learning  of  multiple  submodels.  View¬ 
ing  these  submodels  as  components  of  an  overall  model,  our  framework  permits  the  submodels  to  be 
trained  jointly  without  paying  the  computational  cost  associated  with  an  actual  jointly-normalized 
probability  model.  We  have  presented  an  objective  function  for  agreement-based  learning  and  three 
EM-style  algorithms  that  maximize  this  objective  or  approximations  to  this  objective.  We  have  also 
demonstrated  the  applicability  of  our  approach  to  three  important  real-world  tasks.  For  grammar  in¬ 
duction,  our  approach  yields  the  existing  algorithm  of  [5],  providing  an  objective  for  that  algorithm. 
For  word  alignment  and  phylogenetic  HMMs,  our  approach  provides  entirely  new  algorithms. 
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